perm filename TTT[4,KMC]1 blob
sn#021293 filedate 1973-01-24 generic text, type T, neo UTF8
00100 HOW TO USE AND HOW NOT TO USE TURING-LIKE TESTS
00200 IN EVALUATING THE ADEQUACY OF SIMULATION MODELS
00300 K.M. COLBY AND F.D. HILF
00400
00500 It is very easy to become confused about Turing's imitation
00600 game. In part this is due to Turing himself when in his 1950 paper
00700 entitled COMPUTING MACHINERY AND INTELLIGENCE he introduced his
00800 imitation game [3 ]. A careful reading of this paper reveals there are
00900 actually two games proposed , the second of which is commonly called
01000 Turing's test.
01100 In the first imitation game two groups of judges judges,
01200 using teletyped interviews, try to determine which of two
01300 interviewees is a woman. Each judge is initially informed that o∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈∧␈e rejected, especially since
09200 statistical tests are biased in favor of rejecting the null
09300 hypothesis [3]. Yet this answer does not tell us what we would most
09400 like to know, i.e. how to improve the model. Simulation models do not
09500 spring forth in a complete, final and zero-defect form; they must be
09600 gradually developed over time. Pehaps we might obtain a "yes" answer
09700 to the machine-question if we allowed a large number of expert judges
09800 to conduct the interviews themselves rather than studying transcripts
09900 of other interviewers. It would indicate that the model must be
10000 improved but unless we systematically investigated how the judges
10100 succeeded in making the discrimination we would not know what aspects
10200 of the model to work on. The logistics of such a design are immense
10300 and obtaining a large N of judges for sound statistical inference
10400 would require an effort disproportionate to the information-yield.
10500 A more efficient and informative way to use Turing-like tests
10600 is to ask judges to make ordinal ratings along scaled dimensions from
10700 teletyped interviews. We shall term this approach asking the
10800 dimension-question. One can then compare scaled ratings received by
10900 the patients and by the model to precisely determine where and by how much they
11000 differ. Model builders strive for a model which shows
11100 indistinguishability along some dimensions and distinguishability
11200 along others. That is the model converges on what it is supposed to
11300 simulate and diverges from that which it is not.
11400 We mailed paired-interview transcripts to another 100
11500 randomly selected psychiatrists asking them to rate the responses of
11600 the two `patients' along certain dimensions. The judges were divided
11700 into groups, each judge being asked to rate responses of each I-O
11800 pair in the interviews along four dimensions. The total number of dimensions in this
11900 test were twelve- linguistic noncomprehension, thought disorder,
12000 organic brain syndrome, bizarreness, anger, fear, ideas of reference,
12100 delusions, mistrust, depression, suspiciousness and mania. These are
12200 dimensions which psychiatrists commonly use in evaluating patients.
12300 Table 1
12400 Table 1 shows there were significant differences, with the
12500 model eceiving higher scores along the dimensions of linguistic
12600 noncomprehension, bizarreness, anger, mistrust and suspiciousness. On
12700 the dimension of delusions the patients were rated higher. There were
12800 no significant differences along the dimensions of organic brain
12900 syndrome,fear, ideas of reference, depression and mania.
13000 While tests asking the machine-question indicate
13100 indistinguishability at the gross level, a study of the finer
13200 structure os the model's behavior through ratings along scaled
13300 dimensions shows statistically significant differences between
13400 patients and model. These differences are of help to the model
13500 builder in suggesting which aspects of the model must be modified and
13600 improved in order to be considered an adequate simulation of the
13700 class of paranoid patients it is intended to simulate. For example,
13800 it is clear that the language-comprehension of the model must be
13900 improved. Once this has been implemented, a future test will tell us
14000 whether improvement has occurred and by how much in comparison to the
14100 earlier version. Successive identification of particular areas of
14200 failure in the model permits their improvement and the development of
14300 more adequate model-versions.
14400 Further evidence that the machine-question is an insensitive
14500 test appears in Table 2. In this test we constructed a random version
14600 of the paranoid model which utilized the output statements of the
14700 original model but expressed them randomly no matter what the
14800 interviewer said. Two psychiatrists conducted interviews with this
14900 model, transcripts of which were paired with patient interviews and
15000 sent to 200 randomly selected psychiatrists asking both the
15100 machine-question and the dimension-question. Replies were so few to
15200 the first mailing of 100 that another mailing was needed to achieve
15300 the required N - another fact to ponder. Of the 69 replies, 34 (49%)
15400 were right and 35 (51%) wrong. Based on this random sample of 69
15500 psychiatrists we are 95% confident that between 39% and 63% of all
15600 psychiatrists could make the correct identification, again indicating
15700 a chance level. However as shown in table 2 definite differences
15800 appear along the dimensions of linguistic noncomprehension, thought
15900 disorder (get other dimensions from table). On these particular
16000 dimensions we can construct a continuum in which the random version
16100 represents one extreme, the actual patients another. Our (nonrandom)
16200 model lies somewhere between these two extremes, indicating that it
16300 performs significantly better than the random version but still
16400 requires improvement before being indistinguishable from patients. In
16500 other words this approach provides yardsticks for measuring the adequacy of this or
16600 any other dialogue simulation model along the relevant dimensions.
16700 We conclude that when model builders want to conduct tests
16800 which indicate in which direction progress lies and to obtain a
16900 measure of whether progress is being achieved, the way to use
17000 Turing-like tests is to ask expert judges to make ratings along
17100 multiple dimensions considered essential to the model. Useful tests
17200 do not prove a model, they probe it for its sensitivities. Simply
17300 asking the machine-question yields no information about improving
17400 what the model builder knows is only a first approximation. His main
17500 problem is then how to get on with it.
17600
17700
17800 REFERENCES
18000 [1] Colby, K.M., Hilf,F.D., Weber, S. and Kraemer,H.C. Turing-like
18100 indistinguishability tests for the validation of a computer
18200 simulation of paranoid processes. ARTIFICIAL INTELLIGENCE,3,
18300 (1972),199-221.
18400 [2] Meehl, P.E., Theory testing in psychology and physics: a
18500 methodological paradox. PHILOSOPHY OF SCIENCE,34,(1967),103-115.
18600
18700
18800 [3] Turing,A. Computing machinery and intelligence. Reprinted in:
18900 COMPUTERS AND THOUGHT (Feigenbaum, E.A. and Feldman, J.,eds.).
19000 McGraw-Hill, New York,1963,pp. 11-35.